movielens_top40.csv from Canvasauthor_count.csv from CanvasWe will be analysing the MovieLens dataset which contains movie ratings of 58,000 movies by 280,000 users. The entire dataset is too big for us to work with in this lab. It has been preprocessed with only a small subset of the data being considered. If you want to do more exploration yourself, the entire dataset can be downloaded here.
This part of the lab is based on a chapter in an online book by
Rafael Irizarry. You can find it here. There are lots of
examples in this book to show you how to use R for data
science.
This part of the code is for interested students only. You do not need this for the lab.
# Here is the code used to preprocess the data (taken from the Irizarry lab):
library(dplyr)
library(tidyr)
ratings <- read.csv("ml-latest-small/ratings.csv", header = TRUE)
movies <- read.csv("ml-latest-small/movies.csv", header = TRUE)
movielens <- left_join(movies, ratings)
top <- movielens %>%
group_by(movieId) %>%
summarize(n=n(), title = first(title)) %>%
top_n(40, n) %>%
pull(movieId)
x <- movielens %>%
filter(movieId %in% top) %>%
group_by(userId) %>%
filter(n() >= 20) %>%
ungroup() %>%
select(title, userId, rating) %>%
spread(userId, rating)
x <- as.data.frame(x)
rownames(x) <- x$title
x$title <- NULL
colnames(x) <- paste0("user_", colnames(x))
write.table(x, row.names = TRUE, col.names = TRUE, sep = ",", file = "movielens_top40.csv")
Load the data movielens_top40.csv into R.
It contains the top 40 movies with the most ratings and users who rated
at least 20 out of the 40 movies. Note, IDA refers to initial data
analysis. This is important component for all data analytics.
movielens <- read.csv("movielens_top40.csv", header = TRUE)
dim(movielens)
## [1] 40 153
print(movielens[1:5,1:5])
## user_1 user_6 user_7 user_15 user_17
## Aladdin (1992) NA 5 3.0 3 NA
## American Beauty (1999) 5 NA 4.0 4 4.0
## Apollo 13 (1995) NA 4 4.5 NA 3.5
## Back to the Future (1985) 5 NA 5.0 5 4.5
## Batman (1989) 4 3 3.0 NA 4.5
head(movielens)
## user_1 user_6 user_7 user_15 user_17 user_18 user_19
## Aladdin (1992) NA 5 3.0 3 NA 3.5 3
## American Beauty (1999) 5 NA 4.0 4 4.0 NA 4
## Apollo 13 (1995) NA 4 4.5 NA 3.5 NA NA
## Back to the Future (1985) 5 NA 5.0 5 4.5 4.0 4
## Batman (1989) 4 3 3.0 NA 4.5 NA 5
## Braveheart (1995) 4 5 NA NA 4.5 4.5 NA
## user_21 user_28 user_39 user_42 user_45 user_57
## Aladdin (1992) 4.0 NA 4 NA 5.0 4
## American Beauty (1999) 2.0 4.0 5 NA 5.0 5
## Apollo 13 (1995) NA NA NA 5 5.0 3
## Back to the Future (1985) 5.0 NA 4 4 3.5 4
## Batman (1989) 3.5 2.5 4 3 NA 4
## Braveheart (1995) NA 3.5 NA 4 5.0 4
## user_58 user_62 user_63 user_64 user_66 user_68
## Aladdin (1992) 5 NA 4.0 4.0 NA 3.5
## American Beauty (1999) NA NA 5.0 2.5 5 5.0
## Apollo 13 (1995) 4 NA 3.0 NA NA 3.0
## Back to the Future (1985) NA 4.5 5.0 NA 3 3.0
## Batman (1989) 3 NA 4.0 NA 4 4.0
## Braveheart (1995) 5 4.5 2.5 4.0 5 2.5
## user_72 user_82 user_84 user_86 user_91 user_96
## Aladdin (1992) NA 2.5 NA 4 3.5 NA
## American Beauty (1999) 4.5 NA NA 4 NA 5
## Apollo 13 (1995) 4.0 NA 5 NA 3.5 5
## Back to the Future (1985) 4.0 4.0 3 NA 3.5 NA
## Batman (1989) NA 3.5 3 NA 5.0 NA
## Braveheart (1995) 4.5 4.5 NA NA 4.0 5
## user_103 user_105 user_109 user_112 user_115 user_117
## Aladdin (1992) NA NA 3 NA 4 4
## American Beauty (1999) NA 5.0 NA NA 1 NA
## Apollo 13 (1995) 4.0 NA 3 4.0 NA 4
## Back to the Future (1985) NA NA NA 4.0 NA NA
## Batman (1989) NA NA 4 NA 5 3
## Braveheart (1995) 4.5 3.5 5 3.5 3 5
## user_122 user_132 user_135 user_137 user_140 user_141
## Aladdin (1992) NA 3.5 NA 4.0 3 4.0
## American Beauty (1999) NA 4.5 4 NA 4 NA
## Apollo 13 (1995) NA NA NA 3.5 5 3.5
## Back to the Future (1985) 5.0 3.5 NA 3.5 3 2.5
## Batman (1989) 4.5 2.0 5 NA NA NA
## Braveheart (1995) NA NA 4 4.0 4 3.5
## user_144 user_156 user_160 user_166 user_167 user_177
## Aladdin (1992) 4.5 NA NA 5.0 3.0 4
## American Beauty (1999) 4.0 4.5 5 4.0 3.0 4
## Apollo 13 (1995) 3.0 4.0 5 NA 4.0 4
## Back to the Future (1985) NA 3.5 5 NA NA 5
## Batman (1989) 3.5 NA 4 3.5 3.0 3
## Braveheart (1995) 4.5 NA 4 NA 3.5 NA
## user_178 user_179 user_182 user_186 user_187 user_195
## Aladdin (1992) NA NA NA 5 NA NA
## American Beauty (1999) 5.0 NA 5.0 NA 4 4
## Apollo 13 (1995) NA 4 2.5 NA NA 4
## Back to the Future (1985) 4.5 NA 3.0 NA NA 5
## Batman (1989) NA 3 3.5 4 NA NA
## Braveheart (1995) 4.0 5 3.5 NA 3 NA
## user_198 user_199 user_200 user_201 user_202 user_212
## Aladdin (1992) NA NA 4.0 NA 4 NA
## American Beauty (1999) 5 5 3.5 5 4 3.5
## Apollo 13 (1995) NA 4 4.0 4 4 NA
## Back to the Future (1985) 5 NA 4.0 5 4 NA
## Batman (1989) 3 3 NA 3 3 NA
## Braveheart (1995) 3 NA 4.5 NA 4 NA
## user_217 user_219 user_220 user_226 user_230 user_232
## Aladdin (1992) NA 4.5 5 4.0 2 3.0
## American Beauty (1999) NA 5.0 NA 4.0 NA NA
## Apollo 13 (1995) NA 4.0 5 4.5 2 4.5
## Back to the Future (1985) 3 3.5 5 4.0 NA 3.0
## Batman (1989) 2 3.5 NA NA 3 NA
## Braveheart (1995) 2 NA NA NA NA 4.5
## user_233 user_239 user_247 user_249 user_254 user_263
## Aladdin (1992) NA 4.0 5 4.0 NA NA
## American Beauty (1999) 3 5.0 4 4.5 5.0 4
## Apollo 13 (1995) 2 NA 3 2.5 4.0 4
## Back to the Future (1985) NA NA 4 4.5 3.5 NA
## Batman (1989) NA NA NA NA 2.5 NA
## Braveheart (1995) 3 4.5 4 5.0 4.0 4
## user_266 user_274 user_275 user_279 user_282 user_288
## Aladdin (1992) NA 4.0 NA 2.0 4.5 4
## American Beauty (1999) NA 5.0 4 3.5 4.5 NA
## Apollo 13 (1995) NA NA NA NA 4.5 3
## Back to the Future (1985) 4 3.5 4 3.5 5.0 5
## Batman (1989) 4 3.0 NA NA 3.5 3
## Braveheart (1995) 5 4.5 NA 4.0 NA 5
## user_292 user_298 user_304 user_305 user_307 user_308
## Aladdin (1992) 4.0 NA 4 NA 4.0 NA
## American Beauty (1999) NA 4.0 2 5.0 4.0 NA
## Apollo 13 (1995) NA NA 5 NA 2.0 NA
## Back to the Future (1985) 4.0 3.5 5 5.0 4.0 NA
## Batman (1989) 3.5 3.5 NA 2.5 4.0 NA
## Braveheart (1995) 2.5 3.0 5 NA 3.5 1
## user_313 user_314 user_317 user_318 user_322 user_328
## Aladdin (1992) NA 3 NA NA NA 3.5
## American Beauty (1999) 4 NA 5 3.5 4.5 NA
## Apollo 13 (1995) NA 4 3 NA 4.0 3.0
## Back to the Future (1985) 2 NA NA 2.5 NA 4.0
## Batman (1989) 5 3 NA NA NA 2.0
## Braveheart (1995) NA 4 5 NA 3.5 1.0
## user_330 user_332 user_334 user_339 user_352 user_354
## Aladdin (1992) 3.0 NA NA NA NA 3.5
## American Beauty (1999) 4.5 4.5 NA 5.0 5 4.0
## Apollo 13 (1995) 3.0 3.5 NA 4.0 NA 4.0
## Back to the Future (1985) 4.0 4.0 3.5 4.0 NA 4.0
## Batman (1989) 4.0 NA NA 2.5 NA 4.0
## Braveheart (1995) 3.5 3.5 NA NA NA NA
## user_357 user_362 user_368 user_370 user_372 user_376
## Aladdin (1992) 4.5 NA NA NA 4 NA
## American Beauty (1999) 3.5 NA 4 3.5 NA NA
## Apollo 13 (1995) 3.5 NA NA NA 3 5.0
## Back to the Future (1985) 4.0 NA NA NA 5 4.5
## Batman (1989) 3.0 4.5 3 4.0 3 NA
## Braveheart (1995) 4.0 4.0 4 NA 4 3.5
## user_380 user_381 user_382 user_385 user_387 user_391
## Aladdin (1992) 5 4.0 5 4 2.5 NA
## American Beauty (1999) NA NA NA NA 4.5 4
## Apollo 13 (1995) NA 3.5 4 5 NA 4
## Back to the Future (1985) 5 4.0 NA 4 2.0 4
## Batman (1989) 3 NA NA 3 4.0 4
## Braveheart (1995) 4 NA NA NA 3.5 5
## user_399 user_414 user_415 user_425 user_428 user_432
## Aladdin (1992) NA 4 4.0 3.0 2.0 NA
## American Beauty (1999) 0.5 5 3.5 3.0 3.5 3.5
## Apollo 13 (1995) NA 4 4.0 3.0 2.0 NA
## Back to the Future (1985) 5.0 5 NA NA NA NA
## Batman (1989) NA 4 NA 3.5 3.0 NA
## Braveheart (1995) 3.0 5 NA 4.0 2.5 4.0
## user_434 user_438 user_448 user_452 user_453 user_462
## Aladdin (1992) 4.0 4.0 NA NA 5 NA
## American Beauty (1999) 5.0 NA 4 4 5 3.5
## Apollo 13 (1995) 5.0 4.0 3 NA NA NA
## Back to the Future (1985) 3.5 4.0 5 4 NA 1.5
## Batman (1989) NA 4.0 3 5 NA 3.0
## Braveheart (1995) 4.5 4.5 NA 5 5 NA
## user_464 user_469 user_470 user_474 user_477 user_480
## Aladdin (1992) NA 2 3 4.0 3.0 4.0
## American Beauty (1999) 4 5 NA 3.5 4.5 4.0
## Apollo 13 (1995) NA NA 3 4.5 4.0 3.5
## Back to the Future (1985) NA 3 NA 4.5 4.5 5.0
## Batman (1989) NA 3 3 4.0 NA 4.5
## Braveheart (1995) 5 5 5 3.0 NA 5.0
## user_483 user_489 user_514 user_517 user_522 user_524
## Aladdin (1992) 4.0 3.5 4.0 3.0 4 4
## American Beauty (1999) 4.0 4.0 4.0 1.0 5 NA
## Apollo 13 (1995) 2.0 3.5 4.0 NA NA 5
## Back to the Future (1985) 4.5 3.5 5.0 5.0 5 5
## Batman (1989) 3.5 4.0 2.5 3.0 NA 3
## Braveheart (1995) 4.0 4.5 NA 1.5 4 3
## user_525 user_534 user_551 user_555 user_559 user_560
## Aladdin (1992) 3.5 4.5 NA NA 4 NA
## American Beauty (1999) 4.0 3.5 NA 5 NA 4
## Apollo 13 (1995) 4.0 NA NA 4 3 4
## Back to the Future (1985) 4.0 5.0 4.0 3 NA NA
## Batman (1989) NA 4.0 NA 3 3 NA
## Braveheart (1995) NA NA 3.5 5 4 4
## user_561 user_562 user_570 user_573 user_577 user_580
## Aladdin (1992) NA 4 NA 4.5 NA 2.0
## American Beauty (1999) 3.5 5 4.0 2.0 NA 5.0
## Apollo 13 (1995) NA 3 4.0 3.0 NA NA
## Back to the Future (1985) 4.5 NA 4.0 4.5 5 3.5
## Batman (1989) 4.5 NA NA 4.5 2 3.0
## Braveheart (1995) 5.0 4 3.5 5.0 4 4.5
## user_586 user_590 user_593 user_594 user_596 user_597
## Aladdin (1992) 4.5 4.0 3.5 4.5 NA 4
## American Beauty (1999) NA 3.0 4.5 NA NA 5
## Apollo 13 (1995) NA 4.5 3.0 3.5 3.5 NA
## Back to the Future (1985) 4.5 4.5 NA NA 4.0 5
## Batman (1989) NA 3.5 NA 4.5 3.5 4
## Braveheart (1995) 5.0 4.0 3.0 5.0 NA 5
## user_599 user_600 user_602 user_603 user_606 user_607
## Aladdin (1992) 3.0 3.5 NA NA NA NA
## American Beauty (1999) 5.0 4.5 NA 5 4.5 3
## Apollo 13 (1995) 2.5 2.0 4 NA NA 5
## Back to the Future (1985) 3.5 4.5 NA 2 3.5 3
## Batman (1989) 3.5 2.5 4 2 3.5 3
## Braveheart (1995) 3.5 2.0 5 1 3.5 5
## user_608 user_610
## Aladdin (1992) 3 NA
## American Beauty (1999) 5 3.5
## Apollo 13 (1995) 2 NA
## Back to the Future (1985) 2 5.0
## Batman (1989) 3 4.5
## Braveheart (1995) 4 4.5
Given the large amount of variables, a natural high-dimensional visualization method is to cluster the movies based on different user ratings. We will look at how to do this in R.
hclust usagePerform hierarchical clustering using the hclust()
function and plot the resulting dendrogram. Try it with the
average, complete and single
methods.
hclustUse the cutree() function on the output of
hclust() (with default settings) to separate the movie
titles into four clusters. Can you extract the movies in cluster 1? We
can also cut the tree by defining a height at which the tree should be
cut. Can you find the value of h to cut the tree into four
clusters?
cutree to find 4 clusters and compare to your result in
the previous question.R also offers a number of packages that enable the user to visualize
the data together with the clustering tree. We call these visualizations
“heatmaps” of the data matrix. Download and install the package
ComplexHeatmap using the code provided below and we will
need to ensure the input is a matrix as expected by the function
Heatmap. The arguments row_names_gp and
column_names_gp enable us to reduce the font size.
# BiocManager::install("ComplexHeatmap")
# BiocManager::install("shape")
library(ComplexHeatmap)
movielens_matrix <- as.matrix(movielens)
movielens_matrix <- as.matrix(movielens)
library(ComplexHeatmap)
movielens_matrix <- as.matrix(movielens)
Heatmap(movielens_matrix,
row_names_gp = gpar(fontsize = 7),
column_names_gp = gpar(fontsize = 7))
Suppose we like to compare the effect of two trees and visualize it.
R has a package called dendextend that compare
two dendrograms, it has the following key functions -
untangle(): finds alignment, - tanglegram():
visualize the two dendrograms, - entanglement(): computes
the quality of the alignment.
library(dendextend)
d <- dist(movielens)
# Create two dendrograms
h_avg <- hclust(d, method = "average")
h_single <- hclust(d, method = "single")
dend1 <- as.dendrogram(h_avg)
dend2 <- as.dendrogram(h_single)
# Create a list to hold dendrograms
dend_list <- dendlist(dend1, dend2)
# Compare the two trees
tanglegram(dend_list)
Next, let’s explore the kmeans method. Go back to the
original movies dataset with ratings between 1 to 5 and missing values,
let’s now make a new dataset replacing all the NAs with 0 but keep the
ratings. We are doing this because the kmeans function
cannot handle missing values. In a later module, we will look at how to
handle missing values. Use kmeans to cluster the movies
into four clusters. How many movies are in each cluster?
kmeans clustering, use a
dimension reduction technique such as PCA.Let’s now look at the cluster statistics. Can you plot the total
within group sum of squares for k = 2, 3, 4, 5, 6 from
kmeans(). The tot.withinss is part of the
output value of kmeans. Repeat for between group sum of
squares (betweenss). Do the plots hint at what is the best
k?
Create a shiny app for the author_count data which gives
the user options to decide which visualization technique to use and
calibrate it with any necessary parameters .